ABSTRACT

Fault Tolerance Techniques facilitate systems to carry out tasks in the incidence of faults. A checkpoint is a local state of a process saved on stable storage. In a distributed system, since the processes in the system do not share memory; a global state of the system is defined as a combination of local states, one from each process. In case of a fault in distributed systems, checkpointing enables the execution of a program to be resumed from a previous consistent global state rather than resuming the execution from the commencement. In this way, the sum of constructive processing vanished because of the fault is appreciably reduced. In this paper, we talk about various issues related to the checkpointing for distributed systems and mobile computing environments. We also confer various types of checkpointing: coordinated checkpointing, asynchronous checkpointing, communication induced checkpointing and message logging based checkpointing. We also present a survey of some checkpointing algorithms for distributed systems.

Keywords: - Check pointing algorithms; parallel & distributed computing; rollback recovery; fault-tolerant systems.